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Abstract 



This paper describes serial and parallel compositional models of multiple objects 
with part sharing. Objects are built by part-subpart compositions and expressed 
in terms of a hierarchical dictionary of object parts. These parts are represented 
on lattices of decreasing sizes which yield an executive summary description. We 
describe inference and learning algorithms for these models. We analyze the com- 
plexity of this model in terms of computation time (for serial computers) and num- 
bers of nodes (e.g., "neurons") for parallel computers. In particular, we compute 
the complexity gains by part sharing and its dependence on how the dictionary 
scales with the level of the hierarchy. We explore three regimes of scaling be- 
havior where the dictionary size (i) increases exponentially with the level, (ii) is 
determined by an unsupervised compositional learning algorithm applied to real 
data, (iii) decreases exponentially with scale. This analysis shows that in some 
regimes the use of shared parts enables algorithms which can perform inference 
in time linear in the number of levels for an exponential number of objects. In 
other regimes part sharing has little advantage for serial computers but can give 
linear processing on parallel computers. 



1 Introduction 

A fundamental problem of vision is how to deal with the enormous complexity of images and visual 
scenes The total number of possible images is almost infinitely large (8j. The number of objects 
is also huge and has been estimated at around 30,000 |2 |. How can a biological, or artificial, vision 
system deal with this complexity? For example, considering the enormous input space of images 
and output space of objects, how can humans interpret images in less than 150 msec fT6l ? 

There are three main issues involved. Firstly, how can a visual system be designed so that it can 
efficiently represent large classes of objects, including their parts and subparts? Secondly, how can 
the visual system be designed so that it can rapidly infer which object, or objects, are present in an 
input image and the positions of their subparts? And, thirdly, how can this representation be learnt 
in an unsupervised, or weakly supervised fashion? In short, what visual architectures enable us to 
address these three issues? 

Many considerations suggest that visual architectures should be hierarchical. The structure of mam- 
malian visual systems is hierarchical with the lower levels (e.g., in areas VI and V2) tuned to small 
image features while the higher levels (i.e. in area IT) are tuned to objects ^\ Moreover, as ap- 
preciated by pioneers such as Fukushima (5), hierarchical architectures lend themselves naturally to 

Similar complexity issues will arise for other perceptual and cognitive modalities. 

2 But just because mammalian visual systems are hierarchical does not necessarily imply that this is the best 
design for computer vision systems. 



efficient representations of objects in terms of parts and subparts which can be shared between many 
objects. Hierarchical architectures also lead to efficient learning algorithms as illustrated by deep 
belief learning and others [7 ]. There are many varieties of hierarchical models which differ in details 
of their representations and their learning and inference algorithms (HI U\ S3 HI El US [lOl [51 13. 
But, to the best of our knowledge, there has been no detailed study of their complexity properties. 

This paper provides a mathematical analysis of compositional models (6), which are a subclass of the 
hierarchical models. The key idea of compositionality is to explicitly represent objects by recursive 
composition from parts and subparts. This gives rise to natural learning and inference algorithms 
which proceed from sub-parts to parts to objects (e.g., inference is efficient because a leg detector 
can be used for detecting the legs of cows, horses, and yaks). The explicitness of the object rep- 
resentations helps quantify the efficiency of part-sharing and make mathematical analysis possible. 
The compositional models we study are based on the work of L. Zhu and his collaborators lfT9l[20l 
but we make several technical modifications including a parallel re-formulation of the models. We 
note that in previous papers (191 [20) the representations of the compositional models were learnt in 
an unsupervised manner, which relates to the memorization algorithms of Valiant ifTTl . This paper 
does not address learning but instead explores the consequence of the representations which were 
learnt. 

Our analysis assumes that objects are represented by hierarchical graphical probability models ^ 
which are composed from more elementary models by part-subpart compositions. An object - a 
graphical model with % levels - is defined as a composition of r parts which are graphical models 
with T-L — l levels. These parts are defined recursively in terms of subparts which are represented by 
graphical models of increasingly lower levels. It is convenient to specify these compositional models 
in terms of a set of dictionaries {Mh '• h = 1, .., , H} where the level-/i parts in dictionary Mh are 
composed in terms of level-/i — 1 parts in dictionary Mh-i- The highest level dictionaries Mn 
represent the set of all objects. The lowest level dictionaries M\ represent the elementary features 
that can be measured from the input image. Part-subpart composition enables us to construct a 
very large number of objects by different compositions of elements from the lowest-level dictionary. 
It enables us to perform part-sharing during learning and inference, which can lead to enormous 
reductions in complexity, as our mathematical analysis will show. 

There are three factors which enable computational efficiency. The first is part-sharing, as described 
above, which means that we only need to perform inference on the dictionary elements. The second 
is the executive -summary principle. This principle allows us to represent the state of a part coarsely 
because we are also representing the state of its subparts (e.g., an executive will only want to know 
that "there is a horse in the field" and will not care about the precise positions of its legs). For 
example, consider a letter T which is composed of a horizontal and vertical bar. If the positions 
of these two bars are specified precisely, then we can specify the position of the letter T more 
crudely (sufficient for it to be "bound" to the two bars). This relates to Lee and Mumford's high- 
resolution buffer hypothesis |11| and possibly to the experimental finding that neurons higher up 
the visual pathway are tuned to increasingly complex image features but are decreasingly sensitive 
to spatial position. The third factor is parallelism which arises because the part dictionaries can 
be implemented in parallel, essentially having a set of receptive fields, for each dictionary element. 
This enables extremely rapid inference at the cost of a larger, but parallel, graphical model. 

The compositional section ^ introduces the key ideas. Section ^ describes the inference algo- 
rithms for serial and parallel implementations. Section ^ performs a complexity analysis and 
shows potential exponential gains by using compositional models. 



2 The Compositional Models 

Compositional models are based on the idea that objects are built by compositions of parts which, 
in turn, are compositions of more elementary parts. These are built by part-subpart compositions. 



3 These graphical models contain closed loops but with restricted maximal clique size. 
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Figure I: (a) Compositional part-subpart models for T and L are constructed from the same ele- 
mentary components n, T2, horizontal and vertical bar using different spatial relations A = (/i, a), 
which impose locality. The state x of the parent node gives the summary position of the object, the 
executive summary, while the positions x\,X2 of the components give details about its components, 
(b) The hierarchical lattices. The size of the lattices decrease with scale by a factor q which helps en- 
force executive summary and prevent having multiple hypotheses which overlap too much, q = 1/4 
in this figure. 



2.1 Compositional Part-Subparts 

We formulate part-subpart compositions by probabilistic graphical model which specifies how a part 
is composed of its subparts. A parent node v of the graph represents the part by its type r v and a state 
variable x v (e.g., x v could indicate the position of the part). The r child nodes Ch{y) =J.^i, &v) 

and state variables xchiy) — ( x v! > x u r )\\ The type 



represent the parts by their types r Vl , . . . , r 
of the parent node is specified by r v = (r v 



)i~v r ,\v). Here {r Vl , . . . , r Vr ) are the types of the child 



nodes, and \ v specifies a distribution over the states of the subparts (e.g., over their relative spatial 
positions). Hence the type r v of the parent specifies the part-subpart compositional model. 

The probability distribution for the part-subpart model relates the states of the part and the subparts 
by: 

P{x C h{y)\Xv]T v ) = S(X U - f(xCh{v)))KxCh{v)'Av). U) 

Here /(.) is a deterministic function, so the state of the parent node is determined uniquely by 
the state of the child nodes. The function h(.) specifies a distribution on the relative states of 
the child nodes. The distribution P(xch(u) \ x v\ t v ) obeys a locality principle, which means that 
P(xch(u) \ x v\ Ty) — 0, unless \x Vi — x v \ is smaller than a threshold for all i = 1, .., r. This require- 
ment captures the intuition that subparts of a part are typically close together. 

The state variable x v of the parent node provides an executive summary description of the part. 
Hence they are restricted to take a smaller set of values than the state variables x v . of the subparts. 
Intuitively, the state x v of the parent offers summary information (e.g., there is a cow in the right 
side of a field) while the child states xch(v) °ff er more detailed information (e.g., the position of 
the parts of the cow). In general, information about the object is represented in a distributed manner 
with coarse information at the upper levels of the hierarchy and more precise information at lower 
levels. 



We give examples of part-subpart compositions in figure (1(a)). The compositions represent the 
letters T and L, which are the types of the parent nodes. The types of the child nodes are horizontal 
and vertical bars, indicated by T\ = H, T2 = V. The child state variables #1, X2 indicate the image 
positions of the horizontal and vertical bars. The state variable x of the parent node gives a summary 
description of the position of the letters T and L. The compositional models for letters T and L differ 
by their A parameter which species the relative positions of the horizontal and vertical bars. In this 
example, we choose h{.\ A) to a Gaussian distribution, so A = (/i, a) where fi is the mean relative 
positions between the bars and a is the covariance. We set f(xi,X2) = (1/2) (xi + #2), so the 
state of the parent node specifies the average positions of the child nodes (i.e. the positions of the 
two bars). Hence the two compositional models for the T and L have types tt = (H, V, At) and 
r L = (H,V,X L ). 



In this paper, we assume a fixed value r for all part-subpart compositions. 
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Figure 2: Left Panel: Two part- subpart models. Center Panel: Combining two part- subpart models 
by composition to make a higher level model. Right Panel: Some examples of the shapes that can 
be generated by different parameters settings A of the distribution. 



2.2 Models of Object Categories 

An object category can be modeled by repeated part-subpart compositions. This is illustrated in 
figure ^ where we combine T's and L's with other parts to form more complex objects. More 
generally, we can combine part-subpart compositions into bigger structures by treating the parts as 
subparts of higher order parts. 

More formally, an object category of type is represented by a probability distribution defined over 
a graph V. This graph has a hierarchical structure with levels h G {0, H}, where V = UhLo 
Each object has a single, root node, at level-H (i.e. Vu contains a single node). Any node v G Vh 
(for h > 0) has r children nodes Ch{y) in Vh-i indexed by (y\, v r ). Hence there are r u ~ h 
nodes at level-h (i.e. \Vh\ = r n ~ h ). 

At each node v there is a state variable x v which indicates spatial position and type r v . The type 
of the root node indicates the object category and also specifies the types of its parts. 

The position variables x v take values in a set of lattices {Vh ' h = 0, H}, so that a level-h node, 
v G Vh, takes position x v G T>h- The leaf nodes Vo of the graph take values on the image lattice V . 
The lattices are evenly spaced and the nu mber of lattice points decreases by a factor of q < 1 for 



each level, so \T>h\ = q h \Vo\, see figure ( 1(b)). This decrease in number of lattice points imposes 
the executive summary principle. The lattice spacing is designed so that parts do not overlap. At 
higher levels of the hierarchy the parts cover larger regions of the image and so the lattice spacing 
must be larger, and hence the number of lattice points smaller, to prevent overlapping^] 

The probability model for an object category of type is specified by products of part-subpart 
relations: 

)\x v] r v )U{x n ). (2) 

Here U (x-h) is the uniform distribution. 



2.3 Multiple Object Categories, Shared Parts, and Hierarchical Dictionaries 

Now suppose we have a set of object categories tu G 7~L, each of which can be expressed by an 
equation such as equation ([2]). We assume that these objects share parts. To quantify the amount of 
part sharing we define a hierarchical dictionary {Mh ' h = 0,...,H}, where Mh is the dictionary 
of parts at level h. This gives an exhaustive set of the parts of this set of the objects, at all levels 
h = 0, H. The elements of the dictionary Mh are composed from elements of the dictionary 
Mh-i by part-subpart compositions^] 

This gives an alternative way to think of object models. The type variable r v of a node at level h 
(i.e. in Vh) indexes an element of the dictionary Mh- Hence objects can be encoded in terms of 
the hierarchical dictionary. Moreover, we can create new objects by making new compositions from 
existing elements of the dictionaries. 



5 Previous work (20] [T9) was not formulated on lattices and used non-maximal suppression to achieve the 
same effect. 

6 The unsupervised learning algorithm in 1 19 1 automatically generates this hierarchical dictionary. 
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2.4 The Likelihood Function and the Generative Model 



To specify a generative model for each object category we proceed as follows. The prior specifies a 
distribution over the positions and types of the leaf nodes of the object model. Then the likelihood 
function is specified in terms of the type at the leaf nodes (e.g., if the leaf node is a vertical bar, then 
there is a high probability that the image has a vertical edge at that position). 

More formally, the prior P{x\ry), see equation |2]), specifies a distribution over a set of points C = 
{x v : v G Vo} (the leaf nodes of the graph) and specifies their types {r v : v G Vo}. These points 
are required to lie on the image lattice (e.g., x v G Vq). We denote this as {(#, r(x)) : x G £} where 
t(x) is specified in the natural manner (i.e. if x = x v then r(x) = r v ). We specify distributions 
P{I{x)\r{x)) for the probability of the image I(x) at x conditioned on the type of the leaf node. 
We specify a default probability P(I(x)\tq) at positions x where there is no leaf node of the object. 

This gives a likelihood function for the states x = {x v G V} of the object model in terms of the 
image I = {I(x) : x G V }: 

P(l\x) = H P(I(x)\r(x)) x H P(I(x)\r ). (3) 
xec xeVo/c 
The likelihood and the prior, equations (|3|2|), give a generative model for each object category. 



We can extend this in the natural manner to give generative models or two, or more, objects in the 
image provided they do not overlap. Intuitively, this involves multiple sampling from the prior to 
determine the types of the lattice pixels, followed by sampling from P(I(x) |r) at the leaf nodes to 
determine the image I. Similarly, we have a default background model for the entire image if no 
object is present: 

P B (I) = JJ P(I(x)\r ). (4) 

xev 

3 Inference by Dynamic Programming 

The inference task is to determine which objects are present in the image and to specify their po- 
sitions. This involves two subtasks: (i) state estimation, to determine the optimal states of a model 
and hence the position of the objects and its parts, and (ii) model selection, to determine whether 
objects are present or not. As we will show, both tasks can be reduced to calculating and comparing 
log-likelihood ratios which can be performed efficiently using dynamic programming methods. 

We will first describe the simplest case which consists of estimating the state variables of a single 
object model and using model selection to determine whether the object is present in the image and, 
if so, how many times. Next we show that we can perform inference and model selection for multiple 
objects efficiently by exploiting part sharing (using hierarchical dictionaries). Finally, we show how 
these inference tasks can be performed even more efficiently using a parallel implementation. We 
stress that we are performing exact inference and no approximations are made. We are simply 
exploiting part- sharing so that computations required for performing inference for one object can be 
re-used when performing inference for other objects. 

3.1 Inference Tasks: State Detection and Model Selection 

We first describe a standard dynamic programming algorithm for finding the optimal state of a single 
object category model. Then we describe how the same computations can be used to perform model 
selection and to the detection and state estimation if the object appears multiple times in the image 
(non-overlapping) . 

Consider performing inference for a single object category model defined by equations ( |2|3b . To 
calculate the MAP estimate of the state variables requires computing x* = arg max^jlog P(l\x) + 
logP(x;r^)}. By subtracting the constant term logPe(I) from the righthand side, we can re- 
express this as estimating: 

EP(I(x)\t(x)) x— \ 
lo S p/jy m \ + lo S P (%Ch(v) \x v ] r v ) + log U (x n )}. (5) 
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Figure 3: Left Panel: The feedforward pass propagates hypotheses up to the highest level where 
the best state is selected. Center Panel: Feedback propagates information from the top node disam- 
biguating the middle level nodes. Right Panel: Feedback from the middle level nodes propagates 
back to the input layer to resolve ambiguities there. This algorithm rapidly estimates the top-level 
executive summary description in a rapid feed-forward pass. The top-down pass is required to allow 
high-level context to eliminate false hypotheses at the lower levels- "high-level tells low-level to 
stop gossiping". 



Here C denotes the positions of the leaf nodes of the graph, which must be determined during 
inference. 

We estimate x* by performing dynamic programming. This involves a bottom-up pass which recur- 
sively computes quantities <fi(xh,Th) = argmax^/^jlog p^ffi + log P(x;r^)} by the formula: 

r 

<t>{x h ,r h )= max \S^(j){x Vi ,T Vi ) + \0gP(x C h(y)\Ki r v)}' ( 6 ) 

We refer to (j)(xh, r^) as the local evidence for part with state Xh (after maximizing over the states 
of the lower parts of the graphical model). This local evidence is computed bottom-up. We call this 
the local evidence because it ignores the context evidence for the part which will be provided during 
top-down processing (i.e. that evidence for other parts of the object, in consistent positions, will 
strengthen the evidence for this part). 

The bottom-up pass outputs the global evidence <fi(x<u, t-h) for object category at position xu- 
We can detect the most probable state of the object by computing x* u = arg max (j){x^^ry). Then 
we can perform the top-down pass of dynamic programming to estimate the most probable states x* 
of the entire model by recursively performing: 

r 

%Ch(v) = arg max {V^)^.,^.) + \ogP(x C h(v)\Ki T »)}- ( 7 ) 

x Ch(u) ^ 

This outputs the most probable state of the object in the image. Note that the bottom-up process first 
estimates the optimal "executive summary" description of the object (xy) and only later determines 
the optimal estimates of the lower-level states of the object in the top-down pass. Hence, the algo- 
rithm is faster at detecting that there is a cow in the right side of the field (estimated in the bottom-up 
pass) and is slower at determining the position of the feet of the cow (estimated in the top-down 
pass). This is illustrated in figure Q). 

Importantly, we only need to perform slight extensions of this algorithm to compute significantly 
more. First, we can perform model selection - to determine if the object is present in the image - by 
determining if (j)(xy,Ty) > T, where T is a threshold. This is because, by equation ([5]), (j)(xy,Ty) 
is the log-likelihood ratio of the probability that the object is present at position Xy compared to the 
probability that the corresponding part of the image is generated by the background image model 
Pb(-)- Secondly, we can compute the probability that the object occurs several times in the image, 
by computing the set {xy : (j)(xy,ry) > T, to compute the "executive summary" descriptions 
for each object (e.g., the coarse positions of each object). We then perform the top-down pass 
initialized at each coarse position (i.e. at each point of the set described above) to determine the 
optimal configuration for the states of the objects. Hence, we can reuse the computations required 
to detect a single object in order to detect multiple instances of the object (provided there are no 



Figure 4: Sharing. Left Panel: Two Level-2 models A and B which share Level- 1 model b as a 
subpart. Center Panel: Level- 1 model b. Inference computation only requires us to do inference 
over model b once, and then it can be used to computer the optimal states for models A and B. 
Right Panel: Note that is we combine models A and B by a root OR node then we obtain a graphical 
model for both objects. This model has a closed loop which would seem to make inference more 
challenging. But by exploiting the shape part we can do inference optimally despite the closed loop. 
Inference can be done on the dictionaries, far right. 

overlaps^ The number of objects in the image is determined by the log-likelihood ratio test with 
respect to the background model. 



3.2 Inference on Multiple Objects by Part Sharing using the Hierarchical Dictionaries 

Now suppose we want to detect instances of many object categories G Mu simultaneously. We 
can exploit the shared parts by performing inference using the hierarchical dictionaries. 

The main idea is that we need to compute the global evidence (t>{xu,Tu) for all objects G Mu 
and at all positions x-u in the top-level lattice. These quantities could be computed separately for 
each object by performing the bottom-up pass, specified by equation ([6]), for each object. But this 
is wasteful because the objects share parts and so we would be performing the same computations 
multiple times. Instead we can perform all the necessary computations more efficiently by working 
directly with the hierarchical dictionaries. 

More formally, computing the global evidence for all object models and at all positions is specified 
as follows. 

Let V* n = {x n e V n s.t. max <\>(x n ,Tu) > T n }, 

r n eMn 

For x n e T>u, let r^(x H ) = arg max ^(x n ,r n ), 

r n eMn 

P(I\x) 

Detect x*/x n = arg max{log + logP(f; r n * {xn) )} for all x n e V^. (8) 

x/x u rB(lJ 

All these calculations can be done efficiently using the hierarchical dictionaries (except for the max 
and arg max tasks at level H which must be done separately). Recalling that each dictionary element 
at level h is composed, by part- subpart composition, of dictionary elements at level h — 1. Hence 
we can apply the bottom-up update rule in equation ([6]) directly to the dictionary elements. This is 
illustrated in figure Q. As analyzed in the next section, this can yield major gains in computational 
complexity. 

Once the global evidences for each object model have been computed at each position (in the top 
lattice) we can perform winner- take- all to estimate the object model which has largest evidence 
at each position. Then we can apply thresholding to see if it passes the log-likelihood ratio test 
compared to the background model. If it does pass this log-likelihood test, then we can use the 
top-down pass of dynamic programming, see equation ([7]), to estimate the most probable state of all 
parts of the object model. 

7 Note that this is equivalent to performing optimal inference simultaneously over a set of different generative 
models of the image, where one model assumes that there is one instance of the object in the image, another 
models assumes there are two, and so on. 
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We note that we are performing exact inference over multiple object models at the same time. This 
is perhaps un-intuitive to some readers because this corresponds to doing exact inference over a 
probability model which can be expressed as a graph with a large number of closed loops, see 
figure ([4]). But the main point is that part- sharing enables us share inference efficiently between 
many models. 

The only computation which cannot be performed by dynamic programming are the max and 
arg max tasks at level H, see top line of equation ([5]). These are simple operations and require 
order x \V^\ calculations. This will usually be a small number, compared to the complexity 
of other computations. But this will become very large if there are a large number of objects, as we 
will discuss in section ([?]). 

3.3 Parallel Formulation and Inference Algorithm 

Finally, we observe that all the computations required for performing inference on multiple objects 
can be parallelized. This requires computing the quantities in equation ([8]). 



L L L L 

/ / / / 
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AAA A - - - - + ^ + + ± + 
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V\VNV\VN / / A A 

Figure 5: Parallel Hierarchical Implementation. Far Left Panel: Four Level-0 models are spaced 
densely in the image (here an 8 x 2 grid). Left Panel: the four Level- 1 models are sampled at a lower 
rate, and each have 4x1 copies. Right Panel: the four Level-2 models are sampled less frequently. 
Far Right Panel: a bird's eye view of the parallel hierarchy. The dots represent a "column" of 
four Level-0 models. The crosses represent columns containing four Level- 1 models. The triangles 
represent a column of the Level-2 models. 

The parallelization is possible, in the bottom-up pass of dynamic programming, calculations are 
done separately for each position x, see equation ([6]). So we can compute the local evidence for all 
parts in the hierarchical dictionary recursively and in parallel for each position, and hence compute 
the <$>{xw> th) for all x<u G V^i and tu £ M.u> The max and arg max operations at level H can 
also be done in parallel for each position x^ £ T^u • Similarly we can perform the top-down pass 
of dynamic programming, see equation ([7]), in parallel to compute the best configurations of the 
detected objects in parallel for different possible positions of the objects (on the top-level lattice). 

The parallel formulation can be visualized by making copies of the elements of the hierarchical 
dictionary elements (the parts), so that a model at level-h has copies, with one copy at each 
lattice point. Hence at level-h, we have rrth "receptive fields" at each lattice point in T>h with each 
one tuned to a different part th G Mh, see figure ([5]). At level-0, these receptive fields are tuned 
to specific image properties (e.g., horizontal or vertical bars). Note that the receptive fields are 
highly non-linear (i.e. they do not obey any superposition principle^ Moreover, they are influenced 
both by bottom-up processing (during the bottom-up pass) and by top-down processing (during the 
top-down pass). The bottom-up processing computes the local evidence while the top-down pass 
modifies it by the high-level context. 

The computations required by this parallel implementation are illustrated in figure ([6]). The bottom- 
up pass is performed by a two-layer network where the first layer performs an AND operation (to 
compute the local evidence for a specific configuration of the child nodes) and the second layer 



Nevertheless they are broadly speaking, tuned to image stimuli which have the mean shape of the cor- 
responding part Th. In agreement, with findings about mammalian cortex, the receptive fields become more 
sensitive to image structure (e.g., from bars, to more complex shapes) at increasing levels. Moreover, their 
sensitivity to spatial position decreases because at higher levels the models only encode the executive summary 
descriptions, on coarser lattices, while the finer details of the object are represented more precisely at the lower 
levels. 
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Figure 6: Parallel implementation of Dynamic Programming. The left part of the figure shows 
the bottom-up pass of dynamic programming. The local evidence for the parent node is obtained by 
taking the maximum of the scores of the C r possible states of the child nodes. This can be computed 
by a two-layer network where the first level computes the scores for all C r child node states, which 
can be done in parallel, and the second level compute the maximum score. This is like an AND 
operation followed by an OR. The top-down pass requires the parent node to select which of the C r 
child configurations gave the maximum score, and suppressing the other configurations. 



performs an OR, or max operation, to determine the local evidence (by max-ing over the possible 
child configurations^] The top-down pass only has to perform an arg max computation to determine 
which child configuration gave the best local evidence. 

4 Complexity Analysis 

We now analyze the complexity of the inference algorithms for performing the tasks. Firstly, we 
analyze complexity for a single object (without part- sharing). Secondly, we study the complexity 
for multiple objects with shared parts. Thirdly, we consider the complexity of the parallel imple- 
mentation. 

The complexity is expressed in terms of the following quantities: (I) The size \Vq\ of the image. 
(II) The scale decrease factor q (enabling executive summary). (Ill) The number H of levels of the 
hierarchy. (IV) The sizes : h = 1, H} of the hierarchical dictionaries. (V) The number r 

of subparts of each part. (VI) The number C r of possible part- subpart configurations. 

4.1 Complexity for Single Objects and Ignoring Part Sharing 

This section estimates the complexity of inference N So for a single object and the complexity N mo 
for multiple objects when part sharing is not used. These results are for comparison to the complex- 
ities derived in the following section using part sharing. 

The inference complexity for a single object requires computing: (i) the number N^ u of compu- 
tations required by the bottom-up pass, (ii) the number N ms of computations required by model 
selection at the top-level of the hierarchy, and (iii) the number N t d of computations required by the 
top-down pass. 

The complexity Nb u of the bottom-up pass can be computed from equation d6]). This requires a total 
of C r computations for each position x v for each level-h node. There are nodes at level h and 
each can take \T>o\q h positions. This gives a total of \T>o\C r q h r n ~ h computations at level h. This 
can be summed over all levels to yield: 



q/r 



N bu = J2 \V,\C r r n (q/r) h = \V \C r r n J2(q/r) h = \V \C r ^——{l - (q/r) H }. (9) 



=1 



Observe that the main contributions to N bu come from the first few levels of the hierarchy because 
the factors (q/r) h decrease rapidly with h. This calculation uses Y^h=\ xh ~ X< ^\-x ^ or l ar £ e ^ 
we can approximate N bu by \V§\C r q (_ q j r (because (q/r) n will be small). 



9 Note that other hierarchical models, including bio-inspired ones, use similar operations but motivated by 
different reasons. 
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We calculate N ms = q\ n \ \V$\ for the complexity of model selection (which only requires thresh- 
olding at every point on the top-level lattice). 

The complexity N t d of the top-down pass is computed from equation ([7]). At each level there 
are r^~ h nodes and we must compute C r computations for each. This yields complexity of 

Y^h^i C r r\ n \~ h for each possible root node. There are at most q\ n \ \V \ possible root nodes (de- 
pending on the results of the model selection stage). This yields an upper bound: 

^<|P |a<Z |W| f^;{l-^}. (10) 

Clearly the complexity is dominated by the complexity Nb u of the bottom-up pass. For simplicity, 
we will bound/approximate this by: 

^ ^ 

N So = \V \C r ^-— (11) 
1 — q/r 

Now suppose we perform inference for multiple objects simultaneously without exploiting shared 
parts. In this case the complexity will scale linearly with the number \Mn\ of objects. This gives 
us complexity: 

Ti — l 

N mo = \M n \\Vo\C r ^-— (12) 
1 - q/r 

4.2 Computation with Shared Parts in Series and in Parallel 

This section computes the complexity using part sharing. Firstly, for the standard serial implemen- 
tation of part sharing. Secondly, for the parallel implementation. 

Now suppose we perform inference on many objects with part sharing using a serial computer. 
This requires performing computations over the part- subpart compositions between elements of the 
dictionaries. At level h there are \Mh\ dictionary elements. Each can take = q h \V\ possible 
states. The bottom-up pass requires performing C r computations for each of them. This gives a 
total of Y^h=i \Mh\C r \T>o\q h = \T>o\C r J2h=i \<Mh\q h computations for the bottom-up process. 
The complexity of model selection is \T>o\q n x (H + 1) (this is between all the objects, and the 
background model, at all points on the top lattice). As in the previous section, the complexity of the 
top-down process is less than the complexity of the bottom-up process. Hence the complexity for 
multiple objects using part sharing is given by: 

H 

N Ps = \V \C r ^2\M h \q h . (13) 
h=i 

Next consider the parallel implementation. In this case almost all of the computations are performed 
in parallel and so the complexity is now expressed in terms of the number of "neurons" required to 
encode the dictionaries, see figure ([5]). This is specified by the total number of dictionary elements 
multiplied by the number of spatial copies of them: 

n 

N n = ^2\M h \q h \V \. (14) 
h=i 

The computation, both the forward and backward passes of dynamic programming, are linear in the 
number % of levels. We only need to perform the computations illustrated in figure ([6]) between all 
adjacent levels. 

Hence the parallel implementation gives speed which is linear in T~L at the cost of a possibly large 
number N n of "neurons" and connections between them. 

4.3 Advantages of Part Sharing in Different Regimes 

The advantages of part-sharing depend on how the number of parts \Mh\ scales with the level h 
of the hierarchy. In this section we consider three different regimes: (I) The exponential growth 
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regime where the size of the dictionaries increases exponentially with the level h. (II) The empirical 
growth regime where we use the size of the dictionaries found experimentally by compositional 
learning (T9l . (Ill) The exponential decrease regime where the size of the dictionaries decreases 
exponentially with level h. For all these regimes we compare the advantages of the serial and parallel 
implementations using part sharing by comparison to the complexity results without sharing. 

Exponential growth of dictionaries is a natural regime to consider. It occurs when subparts are 
allowed to combine with all other subparts (or a large fraction of them) which means that the number 
of part-subpart compositions is polynomial in the number of subparts. This gives exponential growth 
in the size of the dictionaries if it occurs at different levels (e.g., consider the enormous number of 
objects that can be built using lego). 

An interesting special case of the exponential growth regime is when \Mh\ scales like l/q h , see 
figure (|7])(left panel). In this case the complexity of computation for serial part-sharing, and the 
number of neurons required for parallel implementation, scales only with the number of levels H. 
This follows from equations ( |13|14| ). But nevertheless the number of objects that can be detected 



scales exponentially as q M . By contrast, the complexity of inference without part- sharing scales 



exponentially with q, see equation ( |T2| because we have to perform a fixed number of computations, 
given by equation ( pTj ), for each of an exponential number of objects. This is summarized by the 
following result. 

Result 7: If the number of shared parts scales exponentially by \Mh\ oc then we can perform 

inference for order q n objects using part sharing in time linear in %, or with a number of neu- 
rons linear in T-L for parallel implementation. By contrast, inference without part-sharing requires 
exponential complexity. 

To what extent is exponential growth a reasonable assumption for real world objects? This motivates 
us to study the empirical growth regime using the dictionaries obtained by the compositional learning 
experiments reported in fT9l . In these experiments, the size of the dictionaries increased rapidly at 
the lower levels (i.e. small h) and then decreased at higher levels (roughly consistent with the 
findings of psychophysical studies - Biederman, personal communication). For these "empirical 
dictionaries" we plot the growth, and the number of computations at each level of the hierarchy, in 
figure (|7])(center panel). This shows complexity which roughly agrees with the exponential growth 
model. This can be summarized by the following result: 

Result 2: If \M.h\ grows slower than l/q h and if \M.h\ < r n ~ h then there are gains due to part 
sharing using serial and parallel computers. This is illustrated in figure ^ (center panel) based on 
the dictionaries found by unsupervised computational learning fT9l . In parallel implementations, 
computation is linear in % while requiring a limited number of nodes ("neurons"). 

Finally we consider the exponential decrease regime. To motivate this regime, suppose that the 
dictionaries are used to model image appearance, by contrast to the dictionaries based on geometrical 
features such as bars and oriented edges (as used in 1 19 ]). It is reasonable to assume that there are a 
large number of low-level dictionaries used to model the enormous variety of local intensity patterns. 
The number of higher-level dictionaries can decrease because they can be used to capture a cruder 
summary description of a larger image region, which is another instance of the executive summary 
principle. For example, the low-level dictionaries could be used to provide detailed modeling of 
the local appearance of a cat, or some other animal, while the higher-level dictionaries could give 
simpler descriptions like "cat- fur" or "dog-fur" or simply "fur". In this case, it is plausible that the 
size of the dictionaries decreases exponentially with the level h. The results for this case emphasize 
the advantages of parallel computing. 

Result 3: If \Mh\ = r n ~ h then there is no gain for part sharing if serial computers are used, see 
figure (|7| (right panel). Parallel implementations can do inference in time which is linear in 1-L but 
require an exponential number of nodes ("neurons"). 

Result 3 may appear negative at first glance even for the parallel version since it requires an expo- 
nentially large number of neurons required to encode the lower level dictionaries. But it may relate 
to one of the more surprising facts about the visual cortex in monkeys and humans - namely that the 
first two visual areas, VI and V2, where low-level dictionaries would be implemented are enormous 
compared to the higher levels such as IT where object detection takes places. Current models of VI 
and V2 mostly relegate it to being a large filter bank which seems paradoxical considering their size. 
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(a) (b) (c) 

Figure 7: The curves are plotted as a function of h. Left panel: The first plot is the case where 
Mh = So we have a constant cost for the computations, when we have shared parts. Center 

panel: This plot is based on the experiment of [19]. Right panel: The third plot is the case where 
Mh decreases exponentially. The amount of computation is the same for the shared and non-shared 
cases. A set of plots with different values of r. 



For example, as one theorist IT2l has stated when reviewing the functions of VI and V2 "perhaps 
the most troublesome objection to the picture I have delivered is that an enormous amount of cortex 
is used to achieve remarkably little". Our complexity studies suggest a reason why these visual areas 
may be so large if they are used to encode dictionaries 



5 Discussion 



This paper provides a complexity analysis of what is arguably one of the most fundamental problem 
of visions - how, a biological or artificial vision system could rapidly detect and recognize an enor- 
mous number of different objects. We focus on a class of hierarchical compositional models l20l[T9l 
whose formulation makes it possible to perform this analysis. But we conjecture that similar results 
will apply to related hierarchical models of vision (e.g., those cited in the introduction). 

Technically this paper has required us to re-formulate compositional models so that they can be 
defined on regular lattices (which makes them easier to compare to alternatives such as deep belief 
networks) and a novel parallel implementation. Hopefully the analysis has also clarified the use of 
part-sharing to perform exact inference even on highly complex models, which may not have been 
clear in the original publications. We note that the re-use of computations in this manner might 
relate to methods developed to speed up inference on graphical models, which gives an interesting 
direction to explore. 

Finally, we note that the parallel inference algorithms used by this class of compositional models 
have an interesting interpretation in terms of the bottom-up versus top-down debate concerning pro- 
cessing in the visual cortex (4J. The algorithms have rapid parallel inference, in time which is linear 
in the number of layers, and which rapidly estimates a coarse "executive summary" interpretation 
of the image. The full interpretation of the image takes longer and requires a top-down pass where 
the high-level context is able to resolve ambiguities which occur at the lower levels. Of course, for 
some simple images the local evidence for the low level parts is sufficient to detect the parts in the 
bottom-up pass and so the top-down pass is not needed. But more generally, in the bottom-up pass 
the neurons are very active and represent a large number of possible hypotheses which are pruned 
out during the top-down pass using context, when "high-level tells low-level to stop gossiping". 
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